Red Wine Quality Data Analysis

Table of contents

  • Project Brief
  • Business Stakeholders and Objectives
    • Stakeholder Identification
    • Stakeholder Objectives
  • Notebook Initialization and Exploratory Data Analysis (EDA)
    • Setting up the Coding Environment
      • Importing Libraries
      • Importing Functions
    • Loading the Data
      • Data Brief
      • Cleaning the Data
    • Data Exploration Overview
      • Preliminary Plan for Data Exploration
      • Data Exploration
        • Data Distribution
        • Statistical Interference & Models Fitting
  • Conclusions
    • Insights and Findings
    • Recommendations for Action:
    • Further Areas for Investigation
⇡¶

Project Brief

This database for this project is the Red Wine Quality, sourced from the reference [Cortez et al., 2009] Because of confidentiality and logistical constraints, only data related to physicochemical properties (inputs) and sensory characteristics (the output) are accessible. For instance, there's no information on the type of grapes, the brand of the wine, or its selling price.

⇡¶

Business Stakeholders and Objectives

Stakeholder Identification

  • Wine Chemists: These are professionals who analyze the chemical composition of wine and its ingredients. They would be interested in understanding how different physicochemical properties influence the quality of wine.
  • Wine Critics/Reviewers: Their reviews and ratings can significantly influence the perception of the wine's quality in the market.
  • Wine Production Managers: These are individuals who oversee the wine production process. They would be interested in any findings that could help improve the quality of the wine they produce.
  • Supply Chain Managers: They manage the sourcing of grapes and other materials. Insights from the data can help optimize the supply chain.

Stakeholder Objectives

  • Wine Chemists: To understand the relationship between the chemical composition of wine and its perceived quality.
  • Wine Critics/Reviewers: To provide accurate and reliable reviews for consumers and industry stakeholders.
  • Wine Production Managers: To improve the quality of the wine they produce based on scientific insights.
  • Supply Chain Managers: To ensure a steady supply of high-quality grapes and other materials while minimizing costs.
⇡¶

Notebook Initialization and Exploratory Data Analysis (EDA)

Setting up the Coding Environment

Importing Libraries, and Credentials

In [1]:
# %pip install jupyter pandas numpy scipy statsmodels mord plotly sklearn
In [2]:
import sys
from IPython.display import Image, display

import pandas as pd
import numpy as np

import scipy.stats as stats
from scipy.stats import binom, mannwhitneyu, kendalltau, spearmanr
import statsmodels.api as sm
import mord

import plotly
import plotly.express as px
import plotly.io as pio
import plotly.graph_objects as go
import plotly.subplots as sp

pio.renderers.default = "notebook"
plotly.offline.init_notebook_mode()

from sklearn.ensemble import GradientBoostingRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.utils import resample

Importing Functions

In [3]:
from utils.functions import (
    print_missing_and_duplicates,
    generate_charts,
    generate_histograms,
    bootstrap_ci,
)

⇡¶

Loading the Data

In [4]:
dm = pd.read_csv("./database/winequality-red.csv")
⇡¶

Data Brief

The data consists of one table, with the following columns: fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol, quality. They all refer to the vine type features.

In [5]:
print(dm.shape)
print(dm.columns)
dm.head()
(1599, 12)
Index(['fixed acidity', 'volatile acidity', 'citric acid', 'residual sugar',
       'chlorides', 'free sulfur dioxide', 'total sulfur dioxide', 'density',
       'pH', 'sulphates', 'alcohol', 'quality'],
      dtype='object')
Out[5]:
fixed acidity volatile acidity citric acid residual sugar chlorides free sulfur dioxide total sulfur dioxide density pH sulphates alcohol quality
0 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
1 7.8 0.88 0.00 2.6 0.098 25.0 67.0 0.9968 3.20 0.68 9.8 5
2 7.8 0.76 0.04 2.3 0.092 15.0 54.0 0.9970 3.26 0.65 9.8 5
3 11.2 0.28 0.56 1.9 0.075 17.0 60.0 0.9980 3.16 0.58 9.8 6
4 7.4 0.70 0.00 1.9 0.076 11.0 34.0 0.9978 3.51 0.56 9.4 5
In [6]:
unique_quality = dm["quality"].unique()
count_quality = len(unique_quality)
top_quality = dm["quality"].mode().values[0]
top_quality_freq = dm["quality"].value_counts().max()
total_quality = dm["quality"].count()

print("Unique Quality Ratings:", unique_quality)
print("Count of Unique Quality Ratings:", count_quality)
print("Most Common Quality Rating (Top):", top_quality)
print("Frequency of Most Common Quality Rating:", top_quality_freq)
print("Total Number of Quality Ratings:", total_quality)
Unique Quality Ratings: [5 6 7 4 8 3]
Count of Unique Quality Ratings: 6
Most Common Quality Rating (Top): 5
Frequency of Most Common Quality Rating: 681
Total Number of Quality Ratings: 1599
⇡¶

Cleaning the Data

Checking for duplicates, and missing values

Missing Values and Duplicates Across Columns
JUSTIFICATION FOR PROCESSING:
  • Missing values in a dataset can lead to inaccurate or misleading statistics and machine learning model predictions. They can occur due to various reasons such as data entry errors, failure to collect information, etc. Depending on the nature and extent of these missing values, different strategies can be employed to handle them.
  • Duplicate values in a dataset can occur due to various reasons such as data entry errors, merging of datasets, etc. Duplicates can lead to biased or incorrect results in data analysis. Therefore, it’s important to identify and remove duplicates.
In [7]:
print_missing_and_duplicates(dm)
Data has no missing values.

Duplicates:
240
Drop Duplicative Rows:¶
In [8]:
dm = dm.drop_duplicates()

OUTCOMES:

  • No missing values were found in the datasets.
  • 240 duplicates were found in the Reviews table and were subsequently dropped.
  • The decision was made to reorganize the data into one table, as this could potentially facilitate analysis in further steps.


⇡¶

Data Exploration Overview

Scrutinizing the dataset to identify key patterns, relationships, and trends. This process aids in detecting significant variables and anomalies, leading to more accurate predictions and insights.

Preliminary Plan for Data Exploration

Data Exploration

  1. Data Distribution:

    • Box Plot Analysis of Wine Features
    • Histogram Analysis of Wine Features
    • Predicting Wine Quality
      • Correlation Heatmap of Physiochemical Wine Features
      • Linear Regression Model & Statistical Significance of Features
      • R-squared and Information Criteria
      • Automated Feature Selection
    • Predicting Alcohol Level
      • Linear Regression Model & Statistical Significance of Features
      • R-squared and Information Criteria
  2. Statistical Interference & Models Fitting:

    • Comparison of Wine Quality Across Alcohol Levels
    • Influence of Alcohol Level on Wine Quality
    • Influence of Sulphates Level on Wine Quality
    • Influence of Volatile Acidity Level on Wine Quality
⇡¶

Data Exploration

⇡¶
Data Distribution Check
JUSTIFICATION FOR PROCESSING: The processing of this data is crucial for gaining insights into the trends and patterns across the database.
In [9]:
fig_height = 460
fig_width = fig_height * 3

columns_to_plot = [
    "fixed acidity",
    "volatile acidity",
    "citric acid",
    "residual sugar",
    "chlorides",
    "free sulfur dioxide",
    "total sulfur dioxide",
    "density",
    "pH",
    "sulphates",
    "alcohol",
    "quality",
]

fig = generate_charts(dm, columns_to_plot, fig_height, fig_width)
fig.show()
In [10]:
for column in dm.columns:
    if dm[column].dtype in [float, int]:
        shapiro_test = stats.shapiro(dm[column])

        print(f"Column: {column}")
        print(
            f"Test statistic: {shapiro_test.statistic}, p-value: {shapiro_test.pvalue}"
        )

        if shapiro_test.pvalue > 0.05:
            print("Conclusion: Normally distributed")
        else:
            print("Conclusion: Not normally distributed")

        print("-------------------------------------------")
Column: fixed acidity
Test statistic: 0.9468387961387634, p-value: 9.900713775341854e-22
Conclusion: Not normally distributed
-------------------------------------------
Column: volatile acidity
Test statistic: 0.9701831340789795, p-value: 3.9276597479848726e-16
Conclusion: Not normally distributed
-------------------------------------------
Column: citric acid
Test statistic: 0.9555168151855469, p-value: 6.565239464014468e-20
Conclusion: Not normally distributed
-------------------------------------------
Column: residual sugar
Test statistic: 0.5767284631729126, p-value: 0.0
Conclusion: Not normally distributed
-------------------------------------------
Column: chlorides
Test statistic: 0.4844779968261719, p-value: 0.0
Conclusion: Not normally distributed
-------------------------------------------
Column: free sulfur dioxide
Test statistic: 0.903224766254425, p-value: 1.7512816818347188e-28
Conclusion: Not normally distributed
-------------------------------------------
Column: total sulfur dioxide
Test statistic: 0.8716893196105957, p-value: 5.342316783703258e-32
Conclusion: Not normally distributed
-------------------------------------------
Column: density
Test statistic: 0.992393434047699, p-value: 1.8121114635505364e-06
Conclusion: Not normally distributed
-------------------------------------------
Column: pH
Test statistic: 0.9927249550819397, p-value: 3.0900184810889186e-06
Conclusion: Not normally distributed
-------------------------------------------
Column: sulphates
Test statistic: 0.8302361965179443, p-value: 1.0596848396598458e-35
Conclusion: Not normally distributed
-------------------------------------------
Column: alcohol
Test statistic: 0.9268006086349487, p-value: 3.244671431354563e-25
Conclusion: Not normally distributed
-------------------------------------------
In [11]:
fig_height = 460
fig_width = fig_height * 3

columns_to_plot = dm.drop(columns=["quality"]).columns.tolist()

fig = generate_histograms(dm, columns_to_plot, fig_height, fig_width)
fig.show()

Predicting Wine Quality

We are interested in trying to predict the quality column using the remaining 11 features of wine.

Correlation Analysis

Knowing the distribution and the nature of the data, we will try to investigate which physical and chemical characteristics of the wine can correlate with its quality (dependent variable). For this reason we will print a heatmap to explor linear relationships within data.

In [12]:
correlation_matrix = dm.corr().round(2)
correlation_array = correlation_matrix.to_numpy()

fig = px.imshow(
    correlation_matrix,
    labels=dict(color="Correlation"),
    x=correlation_matrix.index,
    y=correlation_matrix.columns,
    text_auto=True,
    color_continuous_scale="purples",
)
fig.update_layout(
    title="Correlation Heatmap", template="plotly_dark", width=800, height=600
)

fig.show()

Linear Regression Model & Statistical Significance of Features

We'll be fitting a basic linear regression model to predict the wine quality (dm['quality']) using the other features (X).

In [13]:
X = dm.drop("quality", axis=1)
y = dm["quality"]

X = sm.add_constant(X)

model = sm.OLS(y, X)
results = model.fit()

print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                quality   R-squared:                       0.364
Model:                            OLS   Adj. R-squared:                  0.359
Method:                 Least Squares   F-statistic:                     70.02
Date:                Sat, 20 Apr 2024   Prob (F-statistic):          5.83e-124
Time:                        16:20:19   Log-Likelihood:                -1356.8
No. Observations:                1359   AIC:                             2738.
Df Residuals:                    1347   BIC:                             2800.
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                   13.2379     23.522      0.563      0.574     -32.906      59.381
fixed acidity            0.0126      0.029      0.434      0.664      -0.044       0.069
volatile acidity        -1.1204      0.130     -8.593      0.000      -1.376      -0.865
citric acid             -0.1642      0.162     -1.015      0.310      -0.482       0.153
residual sugar           0.0071      0.017      0.419      0.675      -0.026       0.040
chlorides               -1.9303      0.448     -4.304      0.000      -2.810      -1.050
free sulfur dioxide      0.0033      0.002      1.397      0.163      -0.001       0.008
total sulfur dioxide    -0.0027      0.001     -3.394      0.001      -0.004      -0.001
density                 -8.9904     24.002     -0.375      0.708     -56.075      38.094
pH                      -0.4585      0.213     -2.155      0.031      -0.876      -0.041
sulphates                0.9147      0.127      7.202      0.000       0.666       1.164
alcohol                  0.2895      0.029      9.876      0.000       0.232       0.347
==============================================================================
Omnibus:                       26.019   Durbin-Watson:                   1.785
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               37.103
Skew:                          -0.203   Prob(JB):                     8.77e-09
Kurtosis:                       3.701   Cond. No.                     1.15e+05
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.15e+05. This might indicate that there are
strong multicollinearity or other numerical problems.

R-squared and Information Criteria

We'll be evaluating model performance using R-squared (R^2) to understand the proportion of variance explained. Additionally, we'll use popular information criteria such as AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion) to assess model fit and complexity.

In [14]:
print("R-squared:", results.rsquared)

print("AIC:", results.aic)
print("BIC:", results.bic)
R-squared: 0.36379974128919557
AIC: 2737.522148225965
BIC: 2800.0962011957786

Automated Feature Selection

Carrying out additionally Automated Feature Selection for this problem to try to capture the importance of various features on wine quality can help in capturing complex, non-linear relationships between features, which seems to be the case with the wine quality dataset.

In [15]:
X = dm.drop("quality", axis=1)
y = dm["quality"]

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

gbm_model = GradientBoostingRegressor(random_state=42)
gbm_model.fit(X_train, y_train)

gbm_y_pred = gbm_model.predict(X_test)

gbm_mse = mean_squared_error(y_test, gbm_y_pred)
print(f"Gradient Boosting MSE: {gbm_mse}")

importances = gbm_model.feature_importances_
feature_importances = pd.Series(importances, index=X.columns).sort_values(
    ascending=False
)
print(feature_importances)
Gradient Boosting MSE: 0.38290797991756037
alcohol                 0.376684
sulphates               0.161644
volatile acidity        0.127221
total sulfur dioxide    0.070908
chlorides               0.051873
pH                      0.045436
fixed acidity           0.045400
density                 0.041487
free sulfur dioxide     0.029909
residual sugar          0.028848
citric acid             0.020589
dtype: float64

Predicting Alcohol Level

We are interested in trying to predict the alcohol column using the remaining 11 features of wine.

Linear Regression Model & Statistical Significance of Features

We'll be fitting a basic linear regression model to predict the wine alcohol level (dm['alcohol']) using the other features (X).

In [16]:
target_column = "alcohol"

predictors = dm.drop(target_column, axis=1)

target = dm[target_column]

X_train, X_test, y_train, y_test = train_test_split(
    predictors, target, test_size=0.2, random_state=42
)
In [17]:
X_train = sm.add_constant(X_train)
X_test = sm.add_constant(X_test)

model = sm.OLS(y_train, X_train)
results = model.fit()

print(results.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                alcohol   R-squared:                       0.708
Model:                            OLS   Adj. R-squared:                  0.705
Method:                 Least Squares   F-statistic:                     236.5
Date:                Sat, 20 Apr 2024   Prob (F-statistic):          7.53e-278
Time:                        16:20:20   Log-Likelihood:                -968.63
No. Observations:                1087   AIC:                             1961.
Df Residuals:                    1075   BIC:                             2021.
Df Model:                          11                                         
Covariance Type:            nonrobust                                         
========================================================================================
                           coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------------
const                  573.6491     16.096     35.639      0.000     542.066     605.232
fixed acidity            0.5158      0.025     20.763      0.000       0.467       0.565
volatile acidity         0.5275      0.136      3.887      0.000       0.261       0.794
citric acid              0.8076      0.161      5.032      0.000       0.493       1.123
residual sugar           0.2624      0.015     17.848      0.000       0.234       0.291
chlorides               -0.6024      0.438     -1.375      0.170      -1.462       0.258
free sulfur dioxide     -0.0023      0.002     -0.935      0.350      -0.007       0.003
total sulfur dioxide    -0.0019      0.001     -2.277      0.023      -0.003      -0.000
density               -584.8609     16.461    -35.530      0.000    -617.161    -552.561
pH                       3.7643      0.187     20.085      0.000       3.397       4.132
sulphates                0.9255      0.126      7.332      0.000       0.678       1.173
quality                  0.2456      0.026      9.350      0.000       0.194       0.297
==============================================================================
Omnibus:                       72.996   Durbin-Watson:                   1.943
Prob(Omnibus):                  0.000   Jarque-Bera (JB):              127.316
Skew:                           0.483   Prob(JB):                     2.26e-28
Kurtosis:                       4.371   Cond. No.                     7.75e+04
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.75e+04. This might indicate that there are
strong multicollinearity or other numerical problems.

R-squared and Information Criteria

We'll be evaluating model performance using R-squared (R^2) to understand the proportion of variance explained.

In [18]:
y_pred = results.predict(X_test)

r2 = r2_score(y_test, y_pred)
print(f"R-squared: {r2}")
R-squared: 0.679178464873085

OUTCOMES:

  • From both the plots and the tables, the rows identified as outliers can be clearly read across the datasets. However, the outliers represent valid extreme observations, so the decision is to keep them for EDA.
  • The data that is close to be called normally distributed (yet it's not considering a significance level of 0.05) data are: pH distribution, and density distribution. The rest are even more skewed and non-symetric.
  • 0.67 (Positive Correlation) can be seen between `fixed acidity` and `citric acid` & `fixed acidity` and `density` & `total sulfur dioxide` and `free sulfur dioside`: This indicates a moderately strong positive correlation between the variables. This correlation coefficient suggests a relatively predictable relationship between the variables, but it's not perfect.
  • -0.69 (Negative Correlation) can be seen between `fixed acidity` and `pH`: This indicates a moderately strong negative correlation between the variables. Similar to positive correlation, this correlation coefficient suggests a relatively predictable relationship between the variables, but again, it's not perfect.
  • Upon examining the heatmap, it appears that none of the wine properties have a significant impact on the wine’s quality rating. The property with the highest correlation is the alcohol content, with a value of 0.48. This suggests a possible trend where a higher alcohol content might lead to an increase in wine quality and vice versa. However, this correlation is only moderately strong, not very strong.
  • The linear regression model explains 36.4% of the variance in wine quality (R^2 = 0.364), with significant predictors including volatile acidity, chlorides, total sulfur dioxide, pH, sulphates, and alcohol. Coefficients indicate that increases in volatile acidity decrease predicted quality, while higher sulphates and alcohol content are associated with better quality. Model evaluation using AIC (2738.0) and BIC (2800.0) suggests a trade-off between fit and complexity, emphasizing the importance of feature selection and model refinement.
  • The linear regression model explains 70.8% of the variance in wine alcohol level (R^2 = 0.708), with significant predictors including vfixed acidity, volatile acidity, citric acid, residual sugar, density, pH, sulphates, and quality (all with p-values < 0.05), suggesting that changes in these features are likely to impact alcohol levels. Coefficients indicate that increases in volatile acidity decrease predicted quality, while higher sulphates and alcohol content are associated with better quality. The presence of multicollinearity or other numerical problems is suggested by the large condition number (7.75e+04), which may affect the reliability of individual coefficient estimates. Further diagnostics may be needed to address these issues and ensure the robustness of the model
  • We also explored the significance of various characteristics that could potentially influence wine quality by fitting a statistical model to the data. This model was designed to account for non-linear relationships, which are not captured by the heatmap. The Gradient Boosting model yielded the best performance with a Mean Squared Error (MSE) of 0.38290797991756037, and hence, its results were primarily considered for further analysis. Other models like Random Forest Regressor (MSE: 0.3835867647058824) and Support Vector Machines (MSE: 0.91) were also evaluated but their performance was inferior to the Gradient Boosting model. The test results indicate that the alcohol content of the wine could potentially influence its quality. Other significant factors include ‘sulphates’, albeit with a score that’s more than half as much, and ‘volatile acidity’.

⇡¶
Statistical Interference & Models Fitting
The goal of Statistical Inference for Red Wine Quality EDA is to study the impact of wine parameters on the distribution of their quality score. The differences in physiochemical properties will be studied.

Comparison of Wine Quality Across Alcohol Levels

Given the nature of the analyzed data, with quality being ordinal variable and alcohol being continuous and not normally distributed, a non-parametric Kruskal-Wallis test was conducted to analyze whether there are significant differences in alcohol levels across quality categories in the database.

Target Population:

The target population consists of all wine types available in the dataset.

Significance Levels:
The chosen significance level for hypothesis testing was α = 0.05.

Confidence Intervals:
The confidence intervals for the count (frequency) of alcohol levels within wine quality groups are as follows: 5: (0.0, 2.0), 6: (0.0, 2.0), 7: (0.0, 1.0), 4: (0.0, 1.0), 8: (0.0, 0.0), 3: (0.0, 0.0)

Each interval provides a range of plausible values for the true population parameter, which is the count of alcohol levels for wine types within the corresponding quality groups, with a specified level of confidence of 95%. For example, in the lowest quality group (quality score of 3), we can be 95% confident that the true count of alcohol levels falls within the interval (0.0, 0.0). These confidence intervals help to assess the variability in alcohol levels across different wine quality categories.

In [19]:
data = dm["quality"]
counts = data.value_counts()
confidence_level = 0.95

confidence_intervals = {}
for category, count in counts.items():
    lower_bound, upper_bound = binom.interval(
        confidence_level, n=count, p=1 / len(data)
    )
    confidence_intervals[category] = (lower_bound, upper_bound)

print("Confidence Intervals for the Count:", confidence_intervals)
Confidence Intervals for the Count: {5: (0.0, 2.0), 6: (0.0, 2.0), 7: (0.0, 1.0), 4: (0.0, 1.0), 8: (0.0, 0.0), 3: (0.0, 0.0)}

Statistical Hypotheses:
Null Hypothesis (H0): There is no significant difference in alcohol levels across quality categories. Alternative Hypothesis (H1): There are significant differences in alcohol levels across quality categories.

Hypothesis Testing:
To investigate the differences in alcohol levels across quality categories of wines, the Kruskal-Wallis test was conducted. This choice was made due to the presence of some correlation between the variables (0.46 for alcohol level vs wine quality), indicating potential violations of the assumptions of parametric tests. The ordinal nature of the wine quality data and the continuous nature of alcohol levels also supported the suitability of the Kruskal-Wallis test for comparing distributions across multiple groups. The resulting test statistic of 368.56, coupled with a very small p-value (approximately 0.0), provided compelling evidence against the null hypothesis, indicating significant differences in alcohol levels across quality categories of wines. This rigorous approach to hypothesis testing enables us to conclude that meaningful disparities exist in alcohol levels across various wine quality groups.

In [20]:
h_statistic, p_value = stats.kruskal(
    *[group["alcohol"] for name, group in dm.groupby("quality")]
)

print("Kruskal-Wallis H statistic:", h_statistic)
print("p-value:", p_value)

alpha = 0.05
if p_value < alpha:
    print(
        "Reject the null hypothesis: There are significant differences in alcohol levels across quality categories."
    )
else:
    print(
        "Fail to reject the null hypothesis: There is no significant difference in alcohol levels across quality categories."
    )
Kruskal-Wallis H statistic: 368.55570664504876
p-value: 1.7670505210573303e-77
Reject the null hypothesis: There are significant differences in alcohol levels across quality categories.

Model Fitting:

After conducting the hypothesis testing and finding significant differences in alcohol levels across wine quality categories, we will now fit a logistic regression model to further investigate these relationships.

The logistic regression model is a suitable choice here because our target variable 'quality' is ordinal. We are using the LogisticAT function from the mord package, which is designed for ordinal regression tasks like this one.

In [21]:
X = dm[["alcohol"]]
y = dm["quality"]

model = mord.LogisticAT()
model.fit(X, y)

print(model.coef_)
[0.97852509]

Conclusions:

  • The statistical analyses and the constructed confidence intervals lead us to reject the null hypothesis. We can confidently state that significant differences exist in alcohol levels across different wine quality categories.
  • The logistic regression model reveals that for each additional unit of alcohol, the log-odds of the wine being classified in a higher quality category increase by approximately 0.97852509. This finding aligns with our earlier results from the Kruskal-Wallis test, further emphasizing the significant differences in alcohol levels across wine quality categories.
  • It's important to note that while these findings are statistically significant, the practical significance and predictive power of the model will be evaluated in subsequent steps using more comprehensive statistical packages. This will ensure the robustness of our findings and their applicability in real-world scenarios.

Influence of Alcohol Level on Wine Quality

Given the nature of the analyzed data, with quality being ordinal variable and alcohol being continuous and not normally distributed, a non-parametric Kendall's Tau correlation coefficient was conducted to analyze whether there are significant influences of alcohol level on wine quality in the database.

Target Population:

The target population consists of all wine types available in the dataset.

Significance Levels:
The chosen significance level for hypothesis testing was α = 0.05.

Statistical Hypotheses:
Null Hypothesis (H0): Alcohol levels do not affect wine quality.
Alternative Hypothesis (H1): Alcohol levels positively/negatively affect wine quality.

Hypothesis Testing:
To assess the association between alcohol content and wine quality, Kendall's Tau correlation coefficient was calculated. The resulting correlation coefficient of 0.388 indicates a moderately strong positive correlation between alcohol content and wine quality. Furthermore, the p-value associated with this correlation coefficient is extremely small (approximately $1.11 \times 10^{-74}$), well below conventional significance thresholds.

In [22]:
tau, p_value = kendalltau(dm["alcohol"], dm["quality"])
print("Kendall's Tau correlation coefficient:", tau)
print("p-value:", p_value)

alpha = 0.05
if p_value < alpha:
    print(
        "Reject the null hypothesis: There is a significant association between alcohol and quality."
    )
else:
    print(
        "Fail to reject the null hypothesis: There is no significant association between alcohol and quality."
    )
Kendall's Tau correlation coefficient: 0.38762105392452784
p-value: 1.107963287890286e-74
Reject the null hypothesis: There is a significant association between alcohol and quality.

Model Fitting:

Building on the results of our hypothesis testing, which indicated a significant influence of alcohol levels on wine quality, we will now fit an Ordinary Least Squares (OLS) regression model to further investigate this relationship.

The OLS regression model is appropriate here because our target variable 'quality' is continuous. We are using the OLS function from the statsmodels package, which is designed for linear regression tasks like this one.

In the upcoming code cell, we will fit the model, calculate the confidence intervals for our model parameters, and print a summary of our regression results. This will be followed by a scatter plot with a fitted OLS regression line to visually represent the relationship between alcohol content and wine quality.

In [23]:
X = sm.add_constant(dm["alcohol"])
y = dm["quality"]

model = sm.OLS(y, X).fit()

conf_intervals = model.conf_int()

print(conf_intervals)
print(model.summary())
                0         1
const    1.436375  2.182084
alcohol  0.330047  0.401147
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                quality   R-squared:                       0.231
Model:                            OLS   Adj. R-squared:                  0.230
Method:                 Least Squares   F-statistic:                     407.0
Date:                Sat, 20 Apr 2024   Prob (F-statistic):           2.28e-79
Time:                        16:20:21   Log-Likelihood:                -1485.8
No. Observations:                1359   AIC:                             2976.
Df Residuals:                    1357   BIC:                             2986.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          1.8092      0.190      9.519      0.000       1.436       2.182
alcohol        0.3656      0.018     20.174      0.000       0.330       0.401
==============================================================================
Omnibus:                       40.387   Durbin-Watson:                   1.764
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               74.401
Skew:                          -0.210   Prob(JB):                     6.98e-17
Kurtosis:                       4.067   Cond. No.                         103.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [24]:
fig = px.scatter(
    dm,
    x="alcohol",
    y="quality",
    trendline="ols",
    title="Alcohol Content vs. Wine Quality",
)
fig.update_traces(marker=dict(opacity=0.5))
fig.update_layout(xaxis_title="Alcohol", yaxis_title="Quality", template="plotly_dark")
fig.show()

Confidence Intervals:
Intercept (const): If the alcohol content of a wine is zero (hypothetically), we estimate that the quality of the wine is likely to be between 1.436 and 2.182 units with 95% confidence.
Slope (alcohol): For each additional unit increase in alcohol content, we estimate that the quality of the wine is likely to increase between approximately 0.330 and 0.401 units with 95% confidence.

Conclusions:

  • The obtained p-value of $1.11 \times 10^{-74}$ provides compelling evidence to reject the null hypothesis. Therefore, we conclude that there is a statistically significant association between alcohol content and wine quality.
  • This rigorous statistical analysis using Kendall's Tau correlation test enables us to confidently assert the existence of a meaningful and statistically significant relationship between alcohol content and wine quality.
  • Comparison with Previous Testing (Kruskal-Wallis Test):

In contrast to the Kruskal-Wallis test previously conducted to examine differences in alcohol levels across wine quality categories, Kendall's Tau correlation test directly evaluates the strength and direction of association between the continuous variable (alcohol content) and the ordinal variable (wine quality). The significant p-value obtained from the Kendall's Tau test indicates a robust relationship between these variables, supporting the conclusion of a meaningful association.

  • The OLS regression analysis provides evidence of a statistically significant and positive relationship between alcohol content and wine quality. Approximately 23.1% of the variability in wine quality can be explained by variations in alcohol content. These findings contribute valuable insights for understanding and predicting wine quality based on alcohol content.

Influence of Sulphates Level on Wine Quality

Given the nature of the analyzed data, with quality being ordinal variable and sulphates being continuous and not normally distributed, a non-parametric Spearman rank correlation coefficient was conducted to analyze whether there are significant influences of sulphates level on wine quality in the database.

Target Population:

The target population consists of all wine types available in the dataset.

Significance Levels:

The chosen significance level for hypothesis testing was α = 0.05.

Confidence Intervals:

Since the data is not normally distributed, we calculate the confidence interval using bootstrapping. This involves resampling the data with replacement and calculating the statistic of interest. The percentile method can then be used to estimate the confidence interval. The confidence interval [0.44, 1.03025] means that if we were to repeat the experiment many times, then 95% of the time the true mean sulphates level would fall within this interval. In other words, we are 95% confident that the true mean sulphates level in the population of wines lies between 0.44 and 1.03025.

In [25]:
boot = resample(dm["sulphates"], replace=True, n_samples=1000, random_state=1)
boot_ci = np.percentile(boot, [2.5, 97.5])  # 95% confidence interval

print(f"The 95% confidence interval for the mean sulphates is {boot_ci}")
The 95% confidence interval for the mean sulphates is [0.44    1.03025]

Statistical Hypotheses:
Null Hypothesis (H0): Sulphates levels do not affect wine quality.
Alternative Hypothesis (H1): Sulphates levels positively/negatively affect wine quality.

Hypothesis Testing:
To assess the association between sulphates content and wine quality, Spearman rank correlation coefficient was calculated. The resulting correlation coefficient of 0.381 indicates a moderately strong positive correlation between sulphates content and wine quality. Furthermore, the p-value associated with this correlation coefficient is extremely small (approximately $4.43 \times 10^{-48}$), well below conventional significance thresholds.

In [26]:
corr, p_value = spearmanr(dm["sulphates"], dm["quality"])

print(
    f"Spearman Rank Correlation: {corr}, p-value: {p_value} for Sulphates Level vs Wine Quality."
)

r_squared = 0.3805814248370784**2
print("Coefficient of Determination (R-squared): %.3f" % r_squared)
Spearman Rank Correlation: 0.3805814248370784, p-value: 4.432379281577418e-48 for Sulphates Level vs Wine Quality.
Coefficient of Determination (R-squared): 0.145

Model Fitting:

To model the relationship between sulphates level and wine quality, we would use ordinal logistic regression, since the dependent variable (wine quality) is ordinal. This model would provide us with odds ratios for the levels of sulphates, which can be interpreted as the change in odds of having a higher quality rating for each unit increase in sulphates.

In [27]:
model = mord.LogisticAT(alpha=0)
model.fit(dm[["sulphates"]], dm["quality"])

print("Intercept: %.3f" % model.theta_[0])
print("Sulphates Coefficient: %.3f" % model.theta_[1])
Intercept: -3.102
Sulphates Coefficient: -1.217

Conclusion:

Based on the statistical analyses and confidence intervals:

  • the p-value is 4.432379281577418e-48, which is practically zero. This is much smaller than 0.05, so we can conclude that the correlation between sulphates level and wine quality is statistically significant. However, the statistical significance does not always imply practical significance. In this example, the R-squared value is 0.145, which means that approximately 14.5% of the variability in the wine quality can be explained by the sulphates level. The remaining 85.5% could be due to other factors not included in the model or due to inherent randomness.
  • The intercept and coefficient obtained from an ordinal logistic regression model are as follows:
    • Intercept (-3.102): The intercept in a logistic regression model represents the log odds of the outcome (in this case, wine quality being at or above a certain level) when all predictors are zero. However, in the context of the model, the intercept might not have a meaningful interpretation because sulphates cannot be zero.
    • Sulphates Coefficient (-1.217): This is the change in the log odds of the wine quality being at or above a certain level for a one-unit increase in sulphates. Since the coefficient is negative, this means that as the level of sulphates increases, the log odds of the wine quality being at or above a certain level decreases. In other words, higher levels of sulphates are associated with lower wine quality, according to the model.

Influence of Volatile Acidity Level on Wine Quality

Given the nature of the analyzed data, with quality being ordinal variable and volatile acidity being continuous and not normally distributed, a non-parametric Spearman rank correlation coefficient was conducted to analyze whether there are significant influences of sulphates level on wine quality in the database.

Target Population:

The target population consists of all wine types available in the dataset.

Significance Levels:

The chosen significance level for hypothesis testing was α = 0.05.

Confidence Intervals:

Since the data is not normally distributed, we calculate the confidence interval using bootstrapping. This involves resampling the data with replacement and calculating the statistic of interest. The percentile method can then be used to estimate the confidence interval. The confidence interval [0.52, 0.54] means that if we were to repeat the experiment many times, then 95% of the time the true mean volatile acidity level would fall within this interval. In other words, we are 95% confident that the true mean volatile acidity level in the population of wines lies between 0.52 and 0.54.

In [28]:
lower, upper = bootstrap_ci(dm["volatile acidity"], 1000, np.mean, 0.05)
print(
    f"The 95% confidence interval for the mean volatile acidity is ({lower:.2f}, {upper:.2f})"
)
The 95% confidence interval for the mean volatile acidity is (0.52, 0.54)

Statistical Hypotheses:
Null Hypothesis (H0): Volatile acidity levels do not affect wine quality.
Alternative Hypothesis (H1): Volatile acidity levels positively/negatively affect wine quality.

Hypothesis Testing:
To assess the association between volatile acidity content and wine quality, Spearman rank correlation coefficient was calculated. The resulting correlation coefficient of -0.39 indicates a moderately strong negative correlation between volatile acidity content and wine quality. Furthermore, the p-value associated with this correlation coefficient is zero, that is below conventional significance thresholds.

In [29]:
corr, p_value = spearmanr(dm["quality"], dm["volatile acidity"])
print(f"Spearman's rank correlation is {corr:.2f} with a p-value of {p_value:.2f}")

r_squared = (-0.39) ** 2
print("Coefficient of Determination (R-squared): %.3f" % r_squared)
Spearman's rank correlation is -0.39 with a p-value of 0.00
Coefficient of Determination (R-squared): 0.152

Model Fitting:

To model the relationship between sulphates level and wine quality, we would use ordinal logistic regression, since the dependent variable (wine quality) is ordinal. This model would provide us with odds ratios for the levels of sulphates, which can be interpreted as the change in odds of having a higher quality rating for each unit increase in sulphates.

In [30]:
model = mord.LogisticAT(alpha=0)
model.fit(dm[["volatile acidity"]], dm["quality"])

print("Intercept: %.3f" % model.theta_[0])
print("Volatile acidity Coefficient: %.3f" % model.theta_[1])
Intercept: -7.857
Volatile acidity Coefficient: -5.879

Conclusion:

Based on the statistical analyses and confidence intervals:

  • The p-value of 0.00 is less than the common significance level of 0.05. This indicates that the correlation is statistically significant, and we can reject the null hypothesis that there is no correlation between the variables. In other words, there is strong evidence to suggest that there is a significant correlation between wine quality and volatile acidity. In this example, the R-squared value is 0.152, which means that approximately 15.2% of the variability in wine quality can be explained by volatile acidity. The remaining variability could be due to other factors not included in the model or due to inherent randomness.
  • The intercept and coefficient obtained from an ordinal logistic regression model are as follows:
    • Intercept (-7.857): The intercept, also known as the bias, is the predicted value of the dependent variable (wine quality) when all independent variables (in this case, volatile acidity) are 0. However, in the context of the model, an acidity level of 0 is not meaningful, so the intercept might not have a practical interpretation.
    • Volatile Acidity Coefficient (-5.879): This is the change in the log-odds of the dependent variable (wine quality) for a one-unit increase in volatile acidity, holding other variables constant. Since the coefficient is negative, it means that as volatile acidity increases, the wine quality tends to decrease, which aligns with the earlier finding of a negative correlation between these two variables.

⇡¶

CONCLUSIONS

⇡¶

Insights and Findings:

  1. Quality Ratings Overview:

    • The dataset comprises unique quality ratings ranging from 3 to 8, with a total of 6 unique ratings.
    • The most common quality rating observed is 5, with a frequency of 681 instances.
    • Analysis reveals interesting correlations between wine characteristics and quality ratings.
  2. Data Exploration Highlights:

    • The exploration of physicochemical properties indicates several key insights:
      • Significant correlations observed between alcohol content and wine quality.
      • Analysis of sulphates and volatile acidity also shows notable impacts on wine quality.
      • Outliers identified but deemed essential for comprehensive exploratory data analysis (EDA).
  3. Statistical Analyses and Models:

    • Regression models (e.g., Gradient Boosting) demonstrate potential predictors of wine quality.
    • Correlation tests (e.g., Kendall's Tau) establish statistically significant relationships between variables.
    • Findings suggest practical implications for stakeholders in wine production and management.

⇡¶

Recommendations for Action:

  1. Optimizing Alcohol Content:

    • Based on analyses, adjusting alcohol levels could enhance perceived wine quality.
    • Implementing targeted changes in alcohol content may lead to improved consumer reception.
  2. Sulphates Management:

    • Controlling sulphates levels could positively impact overall wine quality.
    • Stakeholders should consider optimizing sulphates usage in production processes.
  3. Quality Assurance Strategies:

    • Develop robust quality assurance protocols integrating identified insights.
    • Regularly monitor key physicochemical properties to maintain consistent wine quality.

⇡¶

Further Areas for Investigation:

  1. Exploring Volatile Acidity Effects:

    • Investigate deeper into volatile acidity's impact on wine quality.
    • Identify precise thresholds where volatile acidity begins to significantly affect perception.
  2. Consumer Perception Studies:

    • Conduct surveys or focus groups to align quality perceptions with analytical findings.
    • Bridge the gap between analytical insights and consumer preferences for targeted improvements.
  3. Longitudinal Data Analysis:

    • Gather longitudinal data to track changes in wine characteristics over time.
    • Monitor trends to adapt production strategies and meet evolving market demands.